Primary Goal:
Perform QC analysis on a single cell data set using the R Seurat package.
Perform QC analysis on a single cell data set using the R Seurat package.
Use the Athena terminal to set up your lab directory and obtain needed files.
In the event the cluster is down, use Canvas to download the needed files into a directory on your laptop and complete the lab locally.
To get help using any command, type “?” followed by the command to see the documentation. E.g. “?plot”
# create and navigate into a new directory for this lab
mkdir scRNASeq
cd scRNASeq
# Copy feature count directory
cp -r /lustre/home/git_repos/CCTR691/scRNASeq/filtered_feature_bc_matrix_WHIM2_106361_HumanOnly .
# Copy supporting files
cp /lustre/home/git_repos/CCTR691/scRNASeq/MitoCodingGenes13_human.txt .
cp /lustre/home/git_repos/CCTR691/scRNASeq/UCD52CR_107086_web_summary.html .
This Rmd file and its associated HTML output will be submitted to GitHub. The GitHub URL will be submitted to Canvas.
The majority of this lab will be completed in the Rmarkdown template EXCEPT this first section.
Download the “UCD52CR_107086_web_summary.html” file to your laptop and open it in a web browser, then answer the first set of questions in the RMarkdown template.
Q1) Was this sample aligned to a SINGLE or MULTIPLE genomes?
This sample was aligned to the human hg19 genome and mouse mm10 genome, thus, multiple.
Q2) With the knowledge that we expect about 5000 cells from this sample, list at least 2 red flags identified by this report. Be sure to explore both tabs.
There seems to be a large number of multiplets, bumping the cell count up from the expected 5000 to 17,000. Also, only 50% and 11% of the reads were mapped during alignment/mapping to the human and mouse genomes, respectively. Reads per cell is quite elevated as well - likely due to the multiplet data. There is also no “knee” in the UMI count to barcodes plot. The distribution currently indicates there isn’t a clear separation between what could be “nonviable cells” and “viable cells” due to the shear number of multiplets present within the data. t-SNE does not show good separation of clusters.
Follow the instructions and code to remove dead cells, generate violin plots, cluster, and create tSNE plots of the sample data.
The sample data you are using is publicly available on GEO (Gene Expression Omnibus) in the Series GSE174391, only it is merged with other samples. For this lab I am providing you with the single WHIM2 PDX sample that has already had the mouse cells removed (so you only have human data).
Follow the steps below and answer the questions. This lab is based off of, but not exactly the same as, the Seurat introductory tutorial. Feel free to explore that tutorial for more in-depth information.
Use the File navigation pane to navigate to your working lab directory that contains this and the other files.
Select the blue gear and click “Set As Working Directory”.
Look in the console for a setwd() command. COPY the path it uses to the code block below to set your working directory globally for this workbook.
Click the green “play” button to execute the code chunk.
knitr::opts_knit$set(root.dir = "~/Downloads/CCTR691_scRNASeq")
library(Seurat)
## Loading required package: SeuratObject
## Loading required package: sp
## 'SeuratObject' was built under R 4.5.0 but the current version is
## 4.5.1; it is recomended that you reinstall 'SeuratObject' as the ABI
## for R may have changed
##
## Attaching package: 'SeuratObject'
## The following objects are masked from 'package:base':
##
## intersect, t
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Load the WHIM2 dataset into Seurat:
whim.data <- Read10X(data.dir = "./filtered_feature_bc_matrix_WHIM2_106361_HumanOnly")
#Initialize the Seruat Object:
whim <- CreateSeuratObject(counts = whim.data, project = "whim2", min.cells = 3, min.features = 200)
#View a summary of the object:
whim
## An object of class Seurat
## 19612 features across 2575 samples within 1 assay
## Active assay: RNA (19612 features, 0 variable features)
## 1 layer present: counts
In the environment tab of the RStudio interface, click on the “whim” object to expand and explore it. Look for the “meta.data” section. This is the most important section as it is where your new annotations are stored, and it tells you the name of annotations added by clustering and other processing.
Q3) What metadata elements are listed in this section?
It lists the orig.ident (identify of the data for reference in code), nCount_RNA (number of reads), and nFeature_RNA (number of genes).
Q4) I told you this is 1 PDX sample, but the summary of the Seurat object says differently. How many samples does this object have? What is this actually telling you? What does the 19,612 number represent?
This object has 2575 samples, this is the cell count - not the number of samples the data was originally derived from. The 19,612 is feature count meaning the number of genes identified across all of these cells.
# Load in the mitochondrial gene IDs:
mitogene_ids <- read.delim("./MitoCodingGenes13_human.txt", header = FALSE, stringsAsFactors = FALSE)[[1]]
# Add the percentage of mito expression as an annotation to your Seurat object:
whim[["percent.mt"]] <- PercentageFeatureSet(whim, features = mitogene_ids)
# Create a Violin plot of the mito expression, nFeature and nCount data
# These last 2 were already in the dataset by default, so we didn’t need to add them.
VlnPlot(whim, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
## Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
## ℹ Please use `rlang::check_installed()` instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Q5) What is a reasonable filtering criteria for the percent of mitochondria expression?
Typically, a 25% cut off is used, however, it comes down to the nature of the dataset. For example, certain cancers have excess mitochondrial activation, thus, it’s likely for this one that a cut off of ~30-35% should be used.
Sub set your cells to remove those that do not pass your mitochondrial cutoff. Make sure to replace the “ADD_YOUR_CUTOFF_HERE” with the appropriate expression (i.e. “> x” or “< x” where “x” is the threshold you specified in Q5 - do not include the quotation marks).
# subset the seurat object for only cells passing your cutoff and print out the new violin plot
whim_filtered <- subset(whim, subset = percent.mt < 25)
# print new violin plot
VlnPlot(whim_filtered, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3, pt.size = .4)
## Warning: Default search for "data" layer in "RNA" assay yielded no results;
## utilizing "counts" layer instead.
Note that we now have two (2) Seurat data sets, “whim” contains the full unfiltered dataset, and “whim_filtered” contains the filtered dataset.
Q6) How many cells survived the filtering? Print out the new “whim_filtered” object to figure this out.
whim_filtered
## An object of class Seurat
## 19612 features across 1314 samples within 1 assay
## Active assay: RNA (19612 features, 0 variable features)
## 1 layer present: counts
Only 1314 cells survived the filtering.
Q7) Compare the 2 plots, before and after. What can you conclude about the cells that did not make the mitochondrial cutoff in terms of nFeature and nCount? If you want to print both plots in the same R code chunk to compare more easily you can do that.
After filtering based on mitochondrial status, ther number of cells with low features (low genes) and low counts (reads) is far less than prior to filtering based on mitochondrial status. This likely indicates that low quality cells (i.e. dead or dying cells) were removed.
You can continue to filter on nFeature and nCount, and I generally do, but for the purposes of this lab we will leave the dataset as-is and move on to normalization and clustering.
Before we can cluster the data and create those cool tSNE and UMAP visualizations, we need to normalize and scale the data, and run a PCA analysis. However, this can take a long time if we use all genes. Thus, we will reduce the set used to the most variable genes (don’t worry, the rest are still there).
# Using the “whim_filtered” Seurat data set, normalize the data.
whim_filtered <- NormalizeData(whim_filtered)
## Normalizing layer: counts
# Before creating the visualizations, we need to reduce the dataset down to only
# those genes with the most variance. These will be our top 2000 most variable features.
whim_filtered <- FindVariableFeatures(whim_filtered, selection.method = "vst", nfeatures = 2000)
## Finding variable features for layer counts
top10 <- head(VariableFeatures(whim_filtered), 10)
# Plot the top10 most variable genes as a scatter plot and mark the top 10
plot1 <- VariableFeaturePlot(whim_filtered)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot2
## Warning in scale_x_log10(): log-10 transformation introduced infinite values.
Q8) Which gene is the most variable?
SCGB2A2
What does scaling do? * Shifts the expression of each gene, so that the mean expression across cells is 0 * Scales the expression of each gene, so that the variance across cells is 1 * This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
Principle Component Analysis is a dimension reduction technique that transforms large data sets with a lot of dimensions into a data set with fewer dimensions, but still retains the majority of the information present, such as variability, in the initial data set. In single cell RNASeq it is used as a first step to identify the most informative principle components that should be used for data visualization by UMAP and tSNE.
# scale data
whim_filtered <- ScaleData(whim_filtered)
## Centering and scaling data matrix
# run PCA
whim_filtered <- RunPCA(whim_filtered, features = VariableFeatures(object = whim_filtered))
## PC_ 1
## Positive: MGP, SCGB3A1, PHLDA1, LBH, SCGB2A2, EMP1, CX3CL1, PLCG2, CEBPD, PXDC1
## PIP, KRT16, ORAI3, ZG16B, INHBA, LINC02613, SCGB2A1, DNAH11, DHRS3, TNFSF10
## MIR210HG, CRYAB, FMO2, KRTAP5-8, RNASE7, LEMD1, NDRG1, CALML5, IGFBP5, CA9
## Negative: UBE2T, CENPF, ASPM, KIFC1, H2AFV, GMNN, ESCO2, TUBA1B, HIST1H4C, TMEM106C
## SIVA1, MKI67, NUSAP1, RRM2, STMN1, PSIP1, TPX2, DEK, LBR, KIF23
## ANP32E, UBE2C, TUBB, BIRC5, NASP, AURKB, DHTKD1, KIF4A, PCLAF, ANLN
## PC_ 2
## Positive: NEAT1, IER3, KLF6, MALAT1, NDRG1, AHNAK, CEBPD, SAT1, VEGFA, PNRC1
## LGALS3, TFCP2L1, ELF3, FBXO32, COL6A2, P4HA2, USP53, KCNQ1OT1, OLMALINC, ATF3
## ZFP36L1, RRAGD, ERO1A, BNIP3L, DUSP4, SOX4, C4orf3, BNIP3, ANKRD37, ZFP36L2
## Negative: H2AFZ, UBE2C, TOP2A, PBK, UBE2S, CCNB1, NUSAP1, PTTG1, TUBA1B, MKI67
## CDCA8, HMGB1, CENPA, HMGB2, CENPF, STMN1, BIRC5, HMMR, TPX2, CCNA2
## NUF2, UBE2T, RRM2, CDK1, MAD2L1, CENPW, ASPM, AURKA, AURKB, PLK1
## PC_ 3
## Positive: HMGB2, TOP2A, NUSAP1, DLGAP5, MKI67, CENPA, CCNA2, TPX2, PFKP, NUF2
## NDRG1, KIF4A, HMMR, UBE2C, HIST1H1B, AURKB, CAMK1D, ASPM, CENPE, PIF1
## CENPF, ERO1A, NDC80, CDKN3, KNL1, NEK2, GTSE1, BUB1, CDCA3, PBK
## Negative: SELENBP1, AZGP1, OAZ3, PAK1IP1, NES, IFITM3, LY6E, EEF1E1, SRM, HACD3
## CITED4, FKBP1A, NME1, SORBS2, NFIA, ROPN1, STAC2, NDUFAF4, SDAD1, TSC22D3
## ZG16B, PLA2R1, SDF2L1, NPM3, ATP5PB, E2F1, MTX1, SELENOW, KCNQ1OT1, PPAN
## PC_ 4
## Positive: AQP3, FAM3D, COL9A3, CCL28, KRT14, CDK6, KLK7, TPT1-AS1, SLC34A2, RAB11FIP1
## CXCL17, KLK10, NCCRP1, LBH, KLF5, S100P, INHBA, KLF6, SULF2, ANXA1
## CEBPD, ATF3, BTG2, CEACAM1, KRT16, CALML5, SCGB3A1, ADD3, NR4A1, KLK6
## Negative: EGLN3, PFKP, ANGPTL4, NRN1, CAMK1D, IGFBP5, SLC16A3, CHI3L1, NDRG1, LOX
## P4HA1, SLC2A1, FZD8, CA9, VEGFA, AKR1C2, ERO1A, NCAN, FAM162A, STC1
## EGFR, BNIP3, CSF3R, SCGB2A2, SCGB1D2, AKR1C3, WDR54, COL23A1, CD5L, ALDOC
## PC_ 5
## Positive: MCM4, DTL, HELLS, MCM10, HSD17B7, CLSPN, FAM111B, MCM3, E2F7, ATAD2
## MSH6, MCM6, CCNE1, XRCC2, CCNE2, ESCO2, CDC6, NCOA7, HIST1H1B, LDLR
## ARHGAP10, SMC1A, RRM2, OLMALINC, HIST1H4C, POLR2J3, MMS22L, NR4A2, NEAT1, STAC2
## Negative: ARL6IP1, LGALS1, PTTG1, S100A16, CDC20, CCNB2, S100A10, NMU, CDKN3, CCNB1
## HMGN2, H2AFV, HMGB3, TMSB4X, ANXA3, ANXA2, SNRPG, TUBA1A, JPT1, PSMA7
## KLK5, NQO1, MIEN1, KRT14, TAGLN2, LCN2, LGALS3, RBP1, HIST3H2A, KLK6
# create heatmap
DimHeatmap(whim_filtered, dims = 1:15, cells = 500, balanced = TRUE)
Q9) What are the axes of each heatmap?
The y-axis is genes, x-axis is cell numbers (for this case, there’s 500)
Q10) How many cells are used in each heatmap?
500
Q11) Describe what happens when the principal component (PC) number gets higher.
The variance between differentially expressed genes decreases as the principal component number gets higher. That’s because the first eigenvector always captures the greatest variance between data points in higher dimensional space and it continues to decrease as the eigenvector (principal component) increases.
To visualize single cell data we need to use enough PCs to represent
the variability in the data without adding too much noise or having too
little. An Elbow plot can help with this along with looking at the
DimHeatmap plots. We need to pick a threshold close to the elbow in the
plot. This indicates where not much more information is being
added.
For the plot generated by the command below it looks to be around 5;
however, the default is usually 15, which is a good cutoff for most
datasets.
# draw elbow plot
ElbowPlot(whim_filtered)
#Create a tSNE plot using the first 5 dimensions.
whim_filtered <- RunTSNE(whim_filtered, dims = 1:5)
DimPlot(whim_filtered, reduction = "tsne")
# add additional code here based on Question 12
Q12) In the code block above, copy the 2 lines of code that generated the tSNE plot to answer the following questions. Note I am expecting at least 4 plots.
What happens when you use fewer dimensions?
#Create a tSNE plot using the first 2 dimensions.
whim_filtered <- RunTSNE(whim_filtered, dims = 1:2)
DimPlot(whim_filtered, reduction = "tsne")
The distance between points in the t-SNE increases when you use less dimensions because you are utilizing a lower number of principal component vectors.
How about only the first dimension?
#Create a tSNE plot using the first dimension.
whim_filtered <- RunTSNE(whim_filtered, dims = 1)
DimPlot(whim_filtered, reduction = "tsne")
The clusters that form are only based on distances between points along a single vector, so we see strings rather than clusters.
What happens when you use more dimensions?
#Create a tSNE plot using the first 30 dimensions.
whim_filtered <- RunTSNE(whim_filtered, dims = 1:50)
DimPlot(whim_filtered, reduction = "tsne")
The definition of clustering increases! That’s because we’re utilizing a greater number of principal components to perform the t-SNE graphs.
This flat pink plot is pretty boring. Let’s add some color by coloring each cell with the expression level of the 2 top most variable genes you identified in Q8.
# change dimensions back to 1:5
whim_filtered <- RunTSNE(whim_filtered, dims = 1:5)
# create the feature plot
FeaturePlot(whim_filtered, features = c("SCGB2A2", "LALBA"), reduction = "tsne")
You should get 2 plots with the most highly expressed gene grouped into a single location. This could be a cluster of cells with similar gene expression profiles. To find out, let’s cluster the dataset, and then color the tSNE plot by cluster.
# find nearest neighbors and cluster cells
whim_filtered <- FindNeighbors(whim_filtered, dims = 1:5)
## Computing nearest neighbor graph
## Computing SNN
whim_filtered <- FindClusters(whim_filtered, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 1314
## Number of edges: 39301
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8145
## Number of communities: 5
## Elapsed time: 0 seconds
# overlay cluster colors onto dim plot
DimPlot(whim_filtered, reduction = "tsne", group.by = "seurat_clusters")
Q13) Which cluster includes the 2 genes you plotted in Q11?
All of the clusters.
Now we can explore each gene one at a time, which is not very
efficient.
Next we will run an all-by-all differential expression analysis to
identify the top 20 gene markers for each cluster.
# find marker genes for each cluster
whim_filtered.markers <- FindAllMarkers(whim_filtered, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Warning: The `slot` argument of `GetAssayData()` is deprecated as of SeuratObject 5.0.0.
## ℹ Please use the `layer` argument instead.
## ℹ The deprecated feature was likely used in the Seurat package.
## Please report the issue at <https://github.com/satijalab/seurat/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## For a (much!) faster implementation of the Wilcoxon Rank Sum Test,
## (default method for FindMarkers) please install the presto package
## --------------------------------------------
## install.packages('devtools')
## devtools::install_github('immunogenomics/presto')
## --------------------------------------------
## After installation of presto, Seurat will automatically use the more
## efficient implementation (no further action necessary).
## This message will be shown once per session
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
# identify the top 20 marker genes for each cluster
top20 <- whim_filtered.markers %>% group_by(cluster) %>% top_n(n = 20, wt = avg_log2FC)
# plot marker gene expression for each cluster
DoHeatmap(whim_filtered, features = top20$gene)
## Warning in DoHeatmap(whim_filtered, features = top20$gene): The following
## features were omitted as they were not found in the scale.data slot for the RNA
## assay: VPS13B, RELB, TRMT2B, PPP1R12C, TCF19, EML1, ERCC6L2, FA2H, ETFDH,
## BTBD1, PHKG1, WASHC2A, DONSON, ZMIZ1, PPP2R5D, ANAPC2, MYO1E, ADAM10,
## STX17-AS1, USP54, PTK6, FUT2, HERC6, TYMSOS, CIT, DTD2, SWSAP1, YTHDF3-AS1,
## ARHGAP42, ZNF480, MAP9, METAP1D, TGFBR2, MLYCD, ASB6, ZNF454, TCTA, ENG, PSEN2
Q14) What is the x-axis for in this plot?
Cells.
Q15) Which 2 clusters have the most distinct expression profile from the rest of the clusters?
Cluster 1 and 4.
Finally, let’s export your list of differential expressed genes to a
text file.
First filter base on significant p-value, then save to a file. Column
explanations are as follows:
# filter on adjusted pval
whim_filtered.markers <- whim_filtered.markers[whim_filtered.markers$p_val_adj <= 0.001,]
# save to a CSV file
write.csv(whim_filtered.markers, file="whim_filtered_SignificantMarkerGenes.csv", quote = FALSE)
Either through Athena, or by downloading this file, open the CSV file and answer the following question.
Q16) What is the log2FC of the top most variable gene you identified in Q8?
LALBA: 1.0294179269377; SCGB2A2: 2.79986898143682
Finally, save this file, then knit your HTML report for submission
via GitHub.
Go back to the Word document for github instructions!